AITopics | vision-and-language navigation

SOAT: AScene-and Object-Aware Transformer for Vision-and-Language Navigation

Neural Information Processing SystemsApr-25-2026, 12:58:01 GMT

A.1 Limitations We propose an approach which exploits object features in addition to scene features for vision-andlanguage navigation (VLN). Our approach is able to utilize object features for better visiolinguistic alignment (see Section 5) despite the domain gap between the images used to train the object detector and VLN data. Specifically, object features are obtained using a Faster R-CNN detector [1] trained on photos from web (Visual Genome [2]), in which objects are typically well framed by the photographer. On the other hand, the VLN datasets used in our experiments contain panoramic images from indoor house scans that capture objects at viewing angles determined by the navigation path. The gap between these two types of data could be eliminated by either fine-tuning or training detector directly on indoor scenes.

agent, artificial intelligence, ascene-and object-aware transformer, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 12:57:58 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

2e5c2cb8d13e8fba78d95211440ba326-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 07:44:51 GMT

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

28f699175783a2c828ae74d53dd3da20-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 04:57:36 GMT

Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate.

artificial intelligence, machine learning, natural language, (13 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

0d9e08f247ca7fbbfd5e50b7ff9cf357-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 19:35:03 GMT

machine learning, natural language, navigation, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.34)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.31)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

0602940f23884f782058efac46f64b0f-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 11:51:41 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Industry: Law (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.70)

Add feedback

Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

Neural Information Processing SystemsApr-24-2026, 11:51:36 GMT

In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately. In this paper, we address the cross-modal alignment challenge from the perspective of fine-grain. Firstly, to alleviate weak cross-modal alignment supervision from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset, namely Landmark-RxR. Secondly, to further enhance local cross-modal alignment under fine-grained supervision, we investigate the focal-oriented rewards with soft and hard forms, by focusing on the critical points sampled from fine-grained Landmark-RxR. Moreover, to fully evaluate the navigation process, we also propose a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories. Experimental results show that our agent has superior navigation performance on Landmark-RxR, en-RxR and R2R.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Neural Information Processing SystemsMar-22-2026, 16:04:28 GMT

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.

artificial intelligence, proceedings, vision-and-language navigation, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

EvolvingGraphicalPlanner: ContextualGlobal PlanningforVision-and-LanguageNavigation

Neural Information Processing SystemsFeb-19-2026, 08:36:04 GMT

The ability to perform effective planning is crucial for building an instructionfollowing agent.

artificial intelligence, machine learning, navigation, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions Heng Li

Neural Information Processing SystemsFeb-18-2026, 08:05:06 GMT

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: